Overview

Background

During the summer of 2012, wild fires ravaged throughout the Algerian territory covering most of the northern part, especially the coastal cities. This disaster was due to the higher than average temperatures which reached as high as 50 degrees Celcius.

Objectives

One important measure against the reproduction of such disasters is the ability to predict their occurrence. Moreover, in this project, we will attempt to predict these forest fires based on multiple features related to weather indices.

Dataset Description

The Dataset we will use to train and test our models consists of 244 observations on two Algerian Wilayas (cities): Sidi-Bel Abbes and Bejaia. The observations have been gathered throughout the duration of 4 months from June to September 2012 for both cities.

The Dataset contains the following variables:

  1. Date: (DD/MM/YYYY) Day, month (‘june’ to ‘september’), year (2012)
  2. Temp: temperature noon (temperature max) in Celsius degrees: 22 to 42
  3. RH: Relative Humidity in %: 21 to 90
  4. Ws: Wind speed in km/h: 6 to 29
  5. Rain: total day in mm: 0 to 16.8
    FWI Components (check this LINK for more information)
  6. Fine Fuel Moisture Code (FFMC) index from the FWI system: 28.6 to 92.5
  7. Duff Moisture Code (DMC) index from the FWI system: 1.1 to 65.9
  8. Drought Code (DC) index from the FWI system: 7 to 220.4
  9. Initial Spread Index (ISI) index from the FWI system: 0 to 18.5
  10. Build-up Index (BUI) index from the FWI system: 1.1 to 68
  11. Fire Weather Index (FWI) Index: 0 to 31.1
  12. Classes: two classes, namely “fire” and “not fire”

Exploratory Data Analysis

We first start off by importing the necessary libraries for our analysis.

The libraries we used are the following:

  1. e1071: this package is used to perform statistic and probabilistic algorithms. In our case it is used to perform SVM
  2. MASS: This package includes many useful functions and data examples, including functions for estimating linear models through generalized least squares (GLS)
  3. plyr: contains tools for splitting, applying and combining data
  4. caret: a powrful library that has a train function which allows us to fit over 230 models uncluding tree-based models
  5. ROCR: a flexible tool for creating cutoff-parameterized 2D performance curves by freely combining two from over 25 performance measures
  6. pROC: a package specifically dedicated to ROC analysis
  7. randomForest: performs classification and regression on a forest of trees using random inputs
  8. gbm: short for generalized boosted models, provides extensions to Schapire’s AdaBoost algorithm
  9. dplyr: used for data manipulation providing a set of functions that are very useful
  10. tidyverse: contains multiple essensial tools and packages such as ggplot2 for visualization
  11. caTools: used for splitting our dataset into train/test sets

Importing the data

The Dataset provided to us was in the form of a .csv file that contained two tables, one table for the observations belonging to the Sidi-Bel Abbes region, and the other for Bejaia.

Before starting our analysis we separated the tables into two distinct files according to the region. We named both files Algerian_forest_fires_dataset_Bejaia.csv and Algerian_forest_fires_dataset_Sidi_Bel_Abbes.csv for Bejaia and Sidi-Bel Abbes respectively.

df_b <- read.csv("./Algerian_forest_fires_dataset_Bejaia.csv")
df_s <- read.csv("./Algerian_forest_fires_dataset_Sidi_Bel_Abbes.csv")

Cleaning and processing the data

We first check the existence of null values in the Dataset, none were found.

colSums(is.na(df_b))
##         day       month        year Temperature          RH          Ws 
##           0           0           0           0           0           0 
##        Rain        FFMC         DMC          DC         ISI         BUI 
##           0           0           0           0           0           0 
##         FWI     Classes 
##           0           0
colSums(is.na(df_s))
##         day       month        year Temperature          RH          Ws 
##           0           0           0           0           0           0 
##        Rain        FFMC         DMC          DC         ISI         BUI 
##           0           0           0           0           0           0 
##         FWI     Classes 
##           0           0

We then process to add a column in both datasets to indicate the region(Wilaya) in each table. We chose the following encoding:

  1. Bejaia = 0
  2. Sidi-Bel Abbes = 1

After that, we proceed to merge both our datasets into one single dataframe using full_join(), this will allow us to easily explore and analyze the data.

## Joining, by = c("day", "month", "year", "Temperature", "RH", "Ws", "Rain",
## "FFMC", "DMC", "DC", "ISI", "BUI", "FWI", "Classes", "Region")

We check again for any NA values that might have been introduced into the dataset by merging the data from both tables, we found out there was one row that contained NA value in DC and FWI. We delete that row since it will not affect our overall dataset.

colSums(is.na(df))
##         day       month        year Temperature          RH          Ws 
##           0           0           0           0           0           0 
##        Rain        FFMC         DMC          DC         ISI         BUI 
##           0           0           0           1           0           0 
##         FWI     Classes      Region 
##           1           0           0
df = df %>% drop_na(DC)
dim(df)
## [1] 243  15

We now proceed to display the different range of values some categorical variables might contain, mainly the Classes and the Region columns.

unique(df$Classes)
## [1] "not fire   "   "fire   "       "not fire     " "not fire    " 
## [5] "fire"          "fire "         "not fire"      "not fire "
unique(df$Region)
## [1] 1 0

We find that the Classes column has values that contain unneeded space characters, we proceed to trim those spaces.

df$Classes <- trimws(df$Classes, which = c("both"))

unique(df$Classes)
## [1] "not fire" "fire"
df = df %>% drop_na(Classes)

We then turn the fire/not fire values into 1/0 respectively for future analysis.

df$Classes <- mapvalues(df$Classes, from=c("not fire","fire"), to=c(0,1))
unique(df$Classes)
## [1] "0" "1"
df$Classes <- as.numeric(df$Classes)
str(df)
## 'data.frame':    243 obs. of  15 variables:
##  $ day        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ month      : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ year       : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ Temperature: int  32 30 29 30 32 35 35 28 27 30 ...
##  $ RH         : int  71 73 80 64 60 54 44 51 59 41 ...
##  $ Ws         : int  12 13 14 14 14 11 17 17 18 15 ...
##  $ Rain       : num  0.7 4 2 0 0.2 0.1 0.2 1.3 0.1 0 ...
##  $ FFMC       : num  57.1 55.7 48.7 79.4 77.1 83.7 85.6 71.4 78.1 89.4 ...
##  $ DMC        : num  2.5 2.7 2.2 5.2 6 8.4 9.9 7.7 8.5 13.3 ...
##  $ DC         : num  8.2 7.8 7.6 15.4 17.6 26.3 28.9 7.4 14.7 22.5 ...
##  $ ISI        : num  0.6 0.6 0.3 2.2 1.8 3.1 5.4 1.5 2.4 8.4 ...
##  $ BUI        : num  2.8 2.9 2.6 5.6 6.5 9.3 10.7 7.3 8.3 13.1 ...
##  $ FWI        : num  0.2 0.2 0.1 1 0.9 3.1 6 0.8 1.9 10 ...
##  $ Classes    : num  0 0 0 0 0 1 1 0 0 1 ...
##  $ Region     : num  1 1 1 1 1 1 1 1 1 1 ...

We delete the year column since all observations were performed in the same year

df <- df[-c(3)]

df_scaled = df
df_scaled[-c(1,2,13,14)] <- scale(df[-c(1,2,13,14)])
str(df_scaled)
## 'data.frame':    243 obs. of  14 variables:
##  $ day        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ month      : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ Temperature: num  -0.042 -0.593 -0.869 -0.593 -0.042 ...
##  $ RH         : num  0.604 0.739 1.211 0.132 -0.138 ...
##  $ Ws         : num  -1.243 -0.887 -0.531 -0.531 -0.531 ...
##  $ Rain       : num  -0.0314 1.6159 0.6175 -0.3809 -0.281 ...
##  $ FFMC       : num  -1.4455 -1.5431 -2.0309 0.1085 -0.0517 ...
##  $ DMC        : num  -0.983 -0.967 -1.007 -0.765 -0.7 ...
##  $ DC         : num  -0.865 -0.873 -0.878 -0.714 -0.668 ...
##  $ ISI        : num  -0.997 -0.997 -1.069 -0.612 -0.708 ...
##  $ BUI        : num  -0.976 -0.969 -0.99 -0.779 -0.716 ...
##  $ FWI        : num  -0.919 -0.919 -0.932 -0.811 -0.825 ...
##  $ Classes    : num  0 0 0 0 0 1 1 0 0 1 ...
##  $ Region     : num  1 1 1 1 1 1 1 1 1 1 ...

Visualizing the data

We have ended up with a clean and scaled dataframe named df_scaled, which we will use to visualize and further explore our data.

Our first instinct is to compare the two regions together in terms of number of fires, and average temperature.

aggregate(df$Classes ~ df$Region, FUN = sum)
aggregate(df$Temperature ~ df$Region, FUN = mean)

We used the unscaled dataset to plot the real life values of the temperatures.

df %>%
  group_by(Region) %>%
  summarise(Region = Region, Number_of_fires = sum(Classes), Temperature = mean(Temperature)) %>%
  ggplot(aes(x=Region, y=Number_of_fires, fill = Temperature))+
  geom_col(position='dodge')
## `summarise()` has grouped output by 'Region'. You can override using the
## `.groups` argument.

We can see that the the Sidi-Bel Abbes region has in total a greater number of fires and a higher average temperature throughout the summer of 2012.

Further Analysis

Correlation Matrix

The previous results push us to suspect a positive relationship between the temperature and the likelihood of having a fire. However, we need to investigate all the other variables, which is why we will plot a correlation matrix of the features in the dataset.

corr_mat <- round(cor(df_scaled),2)
p_mat <- cor_pmat(df_scaled)
 
corr_mat <- ggcorrplot(
  corr_mat, 
  hc.order = FALSE, 
  type = "upper",
  outline.col = "white",
)
 
ggplotly(corr_mat)

Feature Selection

We performed feature selection using the Caret package to determine which features are the most important and which are the least.

In this case, we opted for Linear Discriminant Analysis with Stepwise Feature Selection by specifying stepLDA as our method.

The varImp function returns a measure of importance out of 100 for each of the features. According to the official Caret documentation, the importance metric is calculated by conducting a ROC curve analysis on each predictor; a series of cutoffs is applied to the predictor data to predict the class. The AUC is then computed and is used as a measure of variable importance.

NOTE: we hid the output of the above chunk because it resulted in a long console message, in the end it resulted in the following figure. To see the original output delete the include=FALSE term.

We can see that the variables month, Ws, Region, and day are insignificant compared to other features. We will disregard them in our model. To determine this we used a threshold of 0.7 for the importance measure.

Model Building

For the following models, we will only use the features that were the most significant in our feature selection phase. The selected features are:

  1. Temperature
  2. Rain
  3. FFMC
  4. DMC
  5. DC
  6. ISI
  7. BUI
  8. FWI
  9. RH

Splitting the dataset

We begin by splitting the data into train/test sets with a 80/20 split. This split was chosen by default as a good practice. This will leave us with 191 observations in the training set as well as 52 in the test set. Due to the small nature of the dataset at hand we will later apply cross validation to some models in order to further examine their performance and compare them with each other.

We set a seed of 1

set.seed(40)
split <- sample.split(df_scaled, SplitRatio=0.8)

train_set <- subset(df_scaled, split == "TRUE")
test_set <- subset(df_scaled, split=="FALSE")

dim(train_set)
## [1] 190  14
dim(test_set)
## [1] 53 14

Logistic Regression

Logistic Regression is considered to be an extension of Linear Regression, in which we predict the qualitative response for an observation. It gives us the probability of a certain observation belonging to a class in binomial classification, but can also be extended to be used for multiple classifications.

Training the model

We first start by fitting our model on the training set. As we do that we get an error that our model did not converge, this is due to our model being able to perfectly split the dataset into positive/negative observations. This might soud counterintuitive but this error is a good sign.

logistic_model <- glm(Classes ~ Temperature+Rain+FFMC+DMC+DC+ISI+BUI+FWI+RH, data=train_set, family="binomial")

logistic_model
## 
## Call:  glm(formula = Classes ~ Temperature + Rain + FFMC + DMC + DC + 
##     ISI + BUI + FWI + RH, family = "binomial", data = train_set)
## 
## Coefficients:
## (Intercept)  Temperature         Rain         FFMC          DMC           DC  
##      61.475      -19.741       96.119      170.010       61.541       54.960  
##         ISI          BUI          FWI           RH  
##     783.342      103.168     -669.537       -2.729  
## 
## Degrees of Freedom: 189 Total (i.e. Null);  180 Residual
## Null Deviance:       258.6 
## Residual Deviance: 1.128e-07     AIC: 20

Testing the model

Since logistic regression gives us the probability of each observation belonging to the 1 class, we will use a 0.5 threshold to transform that probability into a classification of either 0 or 1.

After getting our predictions, we will use the confusion matrix function from the caret library that computes a set of performance matrices including f1-score, recall and precision. Other matrices computed include: sensitivity, specificity, prevalence etc. The official documentation for this function and the formulas for all matrices are found in this link. We will only be interested in the f1-score, recall, precision, accuracy and balanced accuracy.

On the training set

Our model gives us an accuracy and an f1 score of 100% on the training set.

preds_logistic <- predict(logistic_model, train_set, type="response")

preds_logistic <- ifelse(preds_logistic >0.5,1,0)
preds_logistic <- as.factor(preds_logistic)

confusionMatrix(preds_logistic, train_set$Classes,
                mode = "everything",
                positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  80   0
##          1   0 110
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9808, 1)
##     No Information Rate : 0.5789     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##               Precision : 1.0000     
##                  Recall : 1.0000     
##                      F1 : 1.0000     
##              Prevalence : 0.5789     
##          Detection Rate : 0.5789     
##    Detection Prevalence : 0.5789     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 1          
## 

On the testing set

On the test set however, we get an accuracy of 88.68% and an f1 score of 88.89%.

train_set$Classes <- as.factor(train_set$Classes)
test_set$Classes <- as.factor(test_set$Classes)

preds_logistic <- predict(logistic_model, test_set, type="response")

preds_logistic <- ifelse(preds_logistic >0.5,1,0)
preds_logistic <- as.factor(preds_logistic)

confusionMatrix(preds_logistic, test_set$Classes,
                mode = "everything",
                positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 23  3
##          1  3 24
##                                           
##                Accuracy : 0.8868          
##                  95% CI : (0.7697, 0.9573)
##     No Information Rate : 0.5094          
##     P-Value [Acc > NIR] : 6.266e-09       
##                                           
##                   Kappa : 0.7735          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.8846          
##          Pos Pred Value : 0.8889          
##          Neg Pred Value : 0.8846          
##               Precision : 0.8889          
##                  Recall : 0.8889          
##                      F1 : 0.8889          
##              Prevalence : 0.5094          
##          Detection Rate : 0.4528          
##    Detection Prevalence : 0.5094          
##       Balanced Accuracy : 0.8868          
##                                           
##        'Positive' Class : 1               
## 

Plotting the ROC curve

As we plot the ROC curve, we can see that the AUC is equal to 88.67%.

## [1] "The AUC is:  0.886752136752137"

LDA

Linear Discriminant Analysis is best used when the decision boundary of our given dataset is assumed to be linear. There are two basic assumptions that LDA takes into consideration:

  1. There is a common variance across all response classes
  2. The distribution of observations in each response class is normal with a class-specific mean, and a common variance

Since LDA assumes that each input variable has the same variance, we will use the standardized data-frame in the train test splits. Each variable in the standardized data-frame has mean of 0 and variance of 1.

Training the model

lda_model = lda(Classes ~ Temperature+Rain+FFMC+DMC+DC+ISI+BUI+FWI+RH, data=train_set, family="binomial")
lda_model
## Call:
## lda(Classes ~ Temperature + Rain + FFMC + DMC + DC + ISI + BUI + 
##     FWI + RH, data = train_set, family = "binomial")
## 
## Prior probabilities of groups:
##         0         1 
## 0.4210526 0.5789474 
## 
## Group means:
##   Temperature       Rain       FFMC        DMC         DC        ISI        BUI
## 0  -0.6311297  0.2187927 -0.8191764 -0.6641356 -0.5741743 -0.8196185 -0.6678032
## 1   0.4792246 -0.3259588  0.6827776  0.5093237  0.4581167  0.6852537  0.5153841
##          FWI         RH
## 0 -0.8109786  0.5342098
## 1  0.6595379 -0.4159940
## 
## Coefficients of linear discriminants:
##                    LD1
## Temperature  0.1278525
## Rain         0.3075713
## FFMC         1.4045714
## DMC         -1.2553968
## DC          -0.3251614
## ISI          0.3967790
## BUI          1.1971748
## FWI          0.8638929
## RH           0.5758128

Testing the model

On the training set

On our training data, the model reached an accuracy of 95.79% and an f1 score of 96.40%.

preds_lda = predict(lda_model,train_set, type="response")
confusionMatrix(preds_lda$class, train_set$Classes,
                mode = "everything",
                positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  75   3
##          1   5 107
##                                           
##                Accuracy : 0.9579          
##                  95% CI : (0.9187, 0.9816)
##     No Information Rate : 0.5789          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9133          
##                                           
##  Mcnemar's Test P-Value : 0.7237          
##                                           
##             Sensitivity : 0.9727          
##             Specificity : 0.9375          
##          Pos Pred Value : 0.9554          
##          Neg Pred Value : 0.9615          
##               Precision : 0.9554          
##                  Recall : 0.9727          
##                      F1 : 0.9640          
##              Prevalence : 0.5789          
##          Detection Rate : 0.5632          
##    Detection Prevalence : 0.5895          
##       Balanced Accuracy : 0.9551          
##                                           
##        'Positive' Class : 1               
## 

On the testing set

As we can see below, the number of false negatives is 1. Our model also yielded an accuracy of 98.11% and an f1 score of 98.11%.

preds_lda = predict(lda_model,test_set, type="response")
confusionMatrix(preds_lda$class, test_set$Classes,
                mode = "everything",
                positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 26  1
##          1  0 26
##                                           
##                Accuracy : 0.9811          
##                  95% CI : (0.8993, 0.9995)
##     No Information Rate : 0.5094          
##     P-Value [Acc > NIR] : 1.556e-14       
##                                           
##                   Kappa : 0.9623          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9630          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9630          
##               Precision : 1.0000          
##                  Recall : 0.9630          
##                      F1 : 0.9811          
##              Prevalence : 0.5094          
##          Detection Rate : 0.4906          
##    Detection Prevalence : 0.4906          
##       Balanced Accuracy : 0.9815          
##                                           
##        'Positive' Class : 1               
## 

Potting the ROC curve

The AUC for LDA was 98.14%, similar to the one for Logistic Regression.

## [1] "The AUC is:  0.981481481481481"

QDA

Quadratic Discriminant Analysis is best used when the decision boundary of our given dataset is assumed to be non-linear. Similarly to LDA, QDA makes two basic assumptions:

  1. There is a different covariance for each of the response classes
  2. The distribution of observations in each response class is normal with a class-specific mean, and a class-specific covariance

Training the model

qda_model = qda(Classes ~ Temperature+Rain+FFMC+DMC+DC+ISI+BUI+FWI+RH, data=train_set)
qda_model
## Call:
## qda(Classes ~ Temperature + Rain + FFMC + DMC + DC + ISI + BUI + 
##     FWI + RH, data = train_set)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4210526 0.5789474 
## 
## Group means:
##   Temperature       Rain       FFMC        DMC         DC        ISI        BUI
## 0  -0.6311297  0.2187927 -0.8191764 -0.6641356 -0.5741743 -0.8196185 -0.6678032
## 1   0.4792246 -0.3259588  0.6827776  0.5093237  0.4581167  0.6852537  0.5153841
##          FWI         RH
## 0 -0.8109786  0.5342098
## 1  0.6595379 -0.4159940

Testing the model

On the training set

Our model yields an accuracy of 98.42% and an f1 score of 98.64% on the training set.

preds_qda = predict(qda_model,train_set, type="response")
confusionMatrix(preds_qda$class, train_set$Classes,
                mode = "everything",
                positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  78   1
##          1   2 109
##                                           
##                Accuracy : 0.9842          
##                  95% CI : (0.9546, 0.9967)
##     No Information Rate : 0.5789          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9676          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9909          
##             Specificity : 0.9750          
##          Pos Pred Value : 0.9820          
##          Neg Pred Value : 0.9873          
##               Precision : 0.9820          
##                  Recall : 0.9909          
##                      F1 : 0.9864          
##              Prevalence : 0.5789          
##          Detection Rate : 0.5737          
##    Detection Prevalence : 0.5842          
##       Balanced Accuracy : 0.9830          
##                                           
##        'Positive' Class : 1               
## 
On the testing set

As we can see below our model yielded an f1-score of 94.55% and an accuracy of 94.34%.

preds_qda = predict(qda_model,test_set, type="response")
confusionMatrix(preds_qda$class, test_set$Classes,
                mode = "everything",
                positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 24  1
##          1  2 26
##                                           
##                Accuracy : 0.9434          
##                  95% CI : (0.8434, 0.9882)
##     No Information Rate : 0.5094          
##     P-Value [Acc > NIR] : 6.652e-12       
##                                           
##                   Kappa : 0.8867          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9630          
##             Specificity : 0.9231          
##          Pos Pred Value : 0.9286          
##          Neg Pred Value : 0.9600          
##               Precision : 0.9286          
##                  Recall : 0.9630          
##                      F1 : 0.9455          
##              Prevalence : 0.5094          
##          Detection Rate : 0.4906          
##    Detection Prevalence : 0.5283          
##       Balanced Accuracy : 0.9430          
##                                           
##        'Positive' Class : 1               
## 

Potting the ROC curve

After plotting the ROC curve we got an AUC of 94.30%, which is worse than both logistic regression and LDA.

## [1] "The AUC is:  0.943019943019943"

We can observe that QDA performs better than LDA on the training data, because it has the tendency to over-fit it. However, LDA performs better on the testing data since it generalizes better on unseen data points.

KNN Classifier

In this section, we will explore KNN’s performance on our problem. We will use hyper parameter tuning to determine the best number of nearest numbers (K) and we will also use repeated cross validation in our training for better performance estimation.

Since KNN is a distance based model, we will here again use our normalized dataset instead of the original.

Training the model

Setting up the Cross-Validation for Hyperparameter tuning

The summaryFunction argument determines which metric to use to determine the performance of a particular hyperparameter setting. Here we shall use defaultSummary which calculates accuracy and kappa statistic.

We have opted to go with the repeated 10 fold cross-validation method repeated 10 times. ClassProbs parameter is set to TRUE and we can set the threshold later when we test our model performance.

training_control <- trainControl(method = "repeatedcv",
                                 summaryFunction = defaultSummary,
                                 classProbs = TRUE,
                                 number = 10,
                                 repeats = 10)
Training with Cross-validation

Now we use the train() function to perform the model training/tuning of the k hyper-parameter. The range of k is from 3 to 85 in steps of 2 meaning we will only have odd values of k only as it is best practice for the KNN clustering.

Another tweak that we need to make on our data-set is to change our target variable values to valid R variable names in order for the KNN algorithm to work with class Probabilities as each values of our target variable will become a variable with its own probability values. Leaving the values as {0,1} will throw an error at us, therefore we will set our Classes variable values back to ‘fire’ and ‘not_fire’ and proceed.

train_set_knn <- train_set
test_set_knn <- test_set
train_set_knn$Classes <- mapvalues(train_set$Classes, from=c(0,1), to=c("not_fire","fire"))
test_set_knn$Classes <- mapvalues(test_set$Classes, from=c(0,1), to=c("not_fire","fire"))
knn_cv <- train(Classes ~ Temperature+Rain+FFMC+DMC+DC+ISI+BUI+FWI+RH, 
                data = train_set_knn,
                method = "knn",
                trControl = training_control,
                metric = "Accuracy",
                tuneGrid = data.frame(k = seq(3, 85, by = 2)))
knn_cv
## k-Nearest Neighbors 
## 
## 190 samples
##   9 predictor
##   2 classes: 'not_fire', 'fire' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 171, 171, 171, 171, 171, 171, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    3  0.9326316  0.8616948
##    5  0.9305263  0.8576177
##    7  0.9342105  0.8647624
##    9  0.9289474  0.8530909
##   11  0.9242105  0.8430160
##   13  0.9215789  0.8376467
##   15  0.9173684  0.8288136
##   17  0.9268421  0.8489962
##   19  0.9226316  0.8409950
##   21  0.9200000  0.8347289
##   23  0.9136842  0.8215274
##   25  0.9121053  0.8182512
##   27  0.9142105  0.8232395
##   29  0.9163158  0.8272735
##   31  0.9178947  0.8305997
##   33  0.9189474  0.8328290
##   35  0.9242105  0.8439728
##   37  0.9231579  0.8417665
##   39  0.9247368  0.8450947
##   41  0.9289474  0.8538833
##   43  0.9310526  0.8586862
##   45  0.9268421  0.8498246
##   47  0.9268421  0.8499320
##   49  0.9268421  0.8500925
##   51  0.9289474  0.8544426
##   53  0.9294737  0.8552139
##   55  0.9310526  0.8585060
##   57  0.9315789  0.8597132
##   59  0.9352632  0.8678480
##   61  0.9336842  0.8648453
##   63  0.9310526  0.8593945
##   65  0.9294737  0.8561021
##   67  0.9252632  0.8476749
##   69  0.9236842  0.8444531
##   71  0.9210526  0.8391447
##   73  0.9173684  0.8316771
##   75  0.9142105  0.8255176
##   77  0.9168421  0.8310081
##   79  0.9126316  0.8223721
##   81  0.9105263  0.8177512
##   83  0.9068421  0.8105702
##   85  0.9063158  0.8091374
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 59.
Distribution of predicted probabilites -threshold inspection-

Inspecting the probabilities reveals that a cutoff probability around 0.5 gives the best classification results. In the function predict the cutoff is set to be 0.5 by default which means we do not need to change it.

preds_knn = predict(knn_cv,train_set_knn, type = "prob")
train_set %>%
  ggplot() +
  aes(x = preds_knn$fire, fill = Classes) +
  geom_histogram(bins = 20) +
  labs(x = "Probability", y = "Count", title = "Distribution of predicted probabilities for value fire" )

Testing the model

When testing our model on the test set however, we get an accuracy of 90.57% and an f1 score of 90.91%

preds_knn = predict(knn_cv,test_set_knn, type="raw")
confusionMatrix(preds_knn, test_set_knn$Classes,
                mode = "everything",
                positive="fire")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction not_fire fire
##   not_fire       23    2
##   fire            3   25
##                                           
##                Accuracy : 0.9057          
##                  95% CI : (0.7934, 0.9687)
##     No Information Rate : 0.5094          
##     P-Value [Acc > NIR] : 7.924e-10       
##                                           
##                   Kappa : 0.8111          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9259          
##             Specificity : 0.8846          
##          Pos Pred Value : 0.8929          
##          Neg Pred Value : 0.9200          
##               Precision : 0.8929          
##                  Recall : 0.9259          
##                      F1 : 0.9091          
##              Prevalence : 0.5094          
##          Detection Rate : 0.4717          
##    Detection Prevalence : 0.5283          
##       Balanced Accuracy : 0.9053          
##                                           
##        'Positive' Class : fire            
## 

Potting the ROC curve

After plotting the ROC curve, we get an AUC of 90.52%.

## [1] "The AUC is:  0.905270655270655"

Ensemble Methods / Tree based methods

Simple Decision Trees and tree pruning

The goal of ensemble modeling is to improve performance over a baseline model by combining multiple models. So, we will set the baseline performance measure by starting with one algorithm. In our case, we will build a simple decision tree.

Decision trees are widely used classifiers in industries based on their transparency in describing rules that lead to a prediction. They are arranged in a hierarchical tree-like structure and are simple to understand and interpret. They are not susceptible to outliers and are able to capture nonlinear relationships.

We will be using the rpart library for creating decision trees. (rpart) stands for recursive partitioning and employs the CART (classification and regression trees) algorithm. Apart from the rpart library, there are many other decision tree libraries like C50, Party, Tree, and mapTree.

library(rpart)
library(rpart.plot)

Training the model without the pruning

Next, we create a decision tree model by calling the rpart function. Let’s first create a base model with default parameters and value. Notice that we do not include any train control meaning that we are not using any bagging, cross validation or pruning techniques. The resulting tree is a simple decision tree that splits our observations of the FFMC variable. We will explore the performance of the model on the train and test sets next.

base_model <- rpart(Classes ~ Temperature+Rain+FFMC+DMC+DC+ISI+BUI+FWI+RH, data = train_set, method = "class")
summary(base_model)
## Call:
## rpart(formula = Classes ~ Temperature + Rain + FFMC + DMC + DC + 
##     ISI + BUI + FWI + RH, data = train_set, method = "class")
##   n= 190 
## 
##       CP nsplit rel error xerror       xstd
## 1 0.9625      0    1.0000 1.0000 0.08506963
## 2 0.0100      1    0.0375 0.0875 0.03245696
## 
## Variable importance
##  ISI FFMC  FWI  DMC  BUI   DC 
##   21   20   18   15   14   13 
## 
## Node number 1: 190 observations,    complexity param=0.9625
##   predicted class=1  expected loss=0.4210526  P(node) =1
##     class counts:    80   110
##    probabilities: 0.421 0.579 
##   left son=2 (83 obs) right son=3 (107 obs)
##   Primary splits:
##       ISI  < -0.4073884 to the left,  improve=86.84845, (0 missing)
##       FFMC < 0.1608133  to the left,  improve=86.79087, (0 missing)
##       FWI  < -0.4751507 to the left,  improve=72.30889, (0 missing)
##       DMC  < -0.5269618 to the left,  improve=49.40724, (0 missing)
##       BUI  < -0.6424139 to the left,  improve=46.02645, (0 missing)
##   Surrogate splits:
##       FFMC < 0.2967052  to the left,  agree=0.984, adj=0.964, (0 split)
##       FWI  < -0.5557897 to the left,  agree=0.947, adj=0.880, (0 split)
##       DMC  < -0.5269618 to the left,  agree=0.874, adj=0.711, (0 split)
##       BUI  < -0.5615897 to the left,  agree=0.858, adj=0.675, (0 split)
##       DC   < -0.6667463 to the left,  agree=0.832, adj=0.614, (0 split)
## 
## Node number 2: 83 observations
##   predicted class=0  expected loss=0.03614458  P(node) =0.4368421
##     class counts:    80     3
##    probabilities: 0.964 0.036 
## 
## Node number 3: 107 observations
##   predicted class=1  expected loss=0  P(node) =0.5631579
##     class counts:     0   107
##    probabilities: 0.000 1.000
#Plot Decision Tree
rpart.plot(base_model)

Testing the unpruned model

On the training set

After exploring the confusion matrix and the different performance metrics, we can see that our base decision tree does not fit the data perfectly and has 3 miss-classifications on the training set. Those 4 false positives caused the model’s accuracy to be 98.42% and the f1 score to be 98.62%.

preds_unpruned= predict(base_model, train_set, type="class")
confusionMatrix(preds_unpruned, train_set$Classes, mode = "everything", positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  80   3
##          1   0 107
##                                           
##                Accuracy : 0.9842          
##                  95% CI : (0.9546, 0.9967)
##     No Information Rate : 0.5789          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9678          
##                                           
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 0.9727          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9639          
##               Precision : 1.0000          
##                  Recall : 0.9727          
##                      F1 : 0.9862          
##              Prevalence : 0.5789          
##          Detection Rate : 0.5632          
##    Detection Prevalence : 0.5632          
##       Balanced Accuracy : 0.9864          
##                                           
##        'Positive' Class : 1               
## 

On the testing set

Our base decision tree performs very well on unseen, an accuracy of 92.45% and an f1-score of 92%.

preds_unpruned = predict(base_model,test_set, type="class")
confusionMatrix(preds_unpruned, test_set$Classes,
                mode = "everything",
                positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 26  4
##          1  0 23
##                                           
##                Accuracy : 0.9245          
##                  95% CI : (0.8179, 0.9791)
##     No Information Rate : 0.5094          
##     P-Value [Acc > NIR] : 8.194e-11       
##                                           
##                   Kappa : 0.8494          
##                                           
##  Mcnemar's Test P-Value : 0.1336          
##                                           
##             Sensitivity : 0.8519          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.8667          
##               Precision : 1.0000          
##                  Recall : 0.8519          
##                      F1 : 0.9200          
##              Prevalence : 0.5094          
##          Detection Rate : 0.4340          
##    Detection Prevalence : 0.4340          
##       Balanced Accuracy : 0.9259          
##                                           
##        'Positive' Class : 1               
## 

ROC curve for unpruned model

This model gives us an AUC of 92.59%.

## [1] "The AUC is:  0.925925925925926"

Training the model with pruning

Pre-pruning is also known as early stopping criteria. As the name suggests, the criteria are set as parameter values while building the rpart model. Below are some of the pre-pruning criteria that can be used. The tree stops growing when it meets any of these pre-pruning criteria, or it discovers the pure classes.

The complexity parameter (cp) in rpart is the minimum improvement in the model needed at each node. It is based on the cost complexity of the model and works as follows:

  • For the given tree, add up the missclassification at every terminal node.
  • Then multiply the number of splits time a penalty term (lambda) and add it to the total misclassification.
  • The lambda is determined through cross-validation and not reported in R.
  • The cp we see using printcp() is the scaled version of lambda over the misclassifcation rate of the overall data.

The cp value is a stopping parameter. It helps speed up the search for splits because it can identify splits that don’t meet this criterium and prune them before going too far.

Other parameters include but are not limited to:

  • maxdepth: This parameter is used to set the maximum depth of a tree.

  • minsplit: It is the minimum number of records that must exist in a node for a split to happen or be attempted.

And one last thing, since we are in a classification setting, we have to specify class as the method used for building our tree instead of ‘anova’ that is used in regression settings.

pruned_base_model <- rpart(Classes ~ Temperature+Rain+FFMC+DMC+DC+ISI+BUI+FWI+RH, data = train_set, method = "class",  control = rpart.control(cp = 0, maxdepth = 8, minsplit = 5))
summary(pruned_base_model)
## Call:
## rpart(formula = Classes ~ Temperature + Rain + FFMC + DMC + DC + 
##     ISI + BUI + FWI + RH, data = train_set, method = "class", 
##     control = rpart.control(cp = 0, maxdepth = 8, minsplit = 5))
##   n= 190 
## 
##       CP nsplit rel error xerror       xstd
## 1 0.9625      0    1.0000 1.0000 0.08506963
## 2 0.0125      1    0.0375 0.1000 0.03460301
## 3 0.0000      3    0.0125 0.0875 0.03245696
## 
## Variable importance
##  ISI FFMC  FWI  DMC  BUI   DC 
##   21   20   18   15   14   12 
## 
## Node number 1: 190 observations,    complexity param=0.9625
##   predicted class=1  expected loss=0.4210526  P(node) =1
##     class counts:    80   110
##    probabilities: 0.421 0.579 
##   left son=2 (83 obs) right son=3 (107 obs)
##   Primary splits:
##       ISI  < -0.4073884 to the left,  improve=86.84845, (0 missing)
##       FFMC < 0.1608133  to the left,  improve=86.79087, (0 missing)
##       FWI  < -0.4751507 to the left,  improve=72.30889, (0 missing)
##       DMC  < -0.5269618 to the left,  improve=49.40724, (0 missing)
##       BUI  < -0.6424139 to the left,  improve=46.02645, (0 missing)
##   Surrogate splits:
##       FFMC < 0.2967052  to the left,  agree=0.984, adj=0.964, (0 split)
##       FWI  < -0.5557897 to the left,  agree=0.947, adj=0.880, (0 split)
##       DMC  < -0.5269618 to the left,  agree=0.874, adj=0.711, (0 split)
##       BUI  < -0.5615897 to the left,  agree=0.858, adj=0.675, (0 split)
##       DC   < -0.6667463 to the left,  agree=0.832, adj=0.614, (0 split)
## 
## Node number 2: 83 observations,    complexity param=0.0125
##   predicted class=0  expected loss=0.03614458  P(node) =0.4368421
##     class counts:    80     3
##    probabilities: 0.964 0.036 
##   left son=4 (77 obs) right son=5 (6 obs)
##   Primary splits:
##       FFMC < 0.1608133  to the left,  improve=2.783133, (0 missing)
##       ISI  < -0.5036757 to the left,  improve=2.354561, (0 missing)
##       DC   < 1.451133   to the left,  improve=0.881898, (0 missing)
##       BUI  < 0.7597094  to the left,  improve=0.881898, (0 missing)
##       FWI  < -0.286993  to the left,  improve=0.881898, (0 missing)
##   Surrogate splits:
##       ISI < -0.5518194 to the left,  agree=0.952, adj=0.333, (0 split)
## 
## Node number 3: 107 observations
##   predicted class=1  expected loss=0  P(node) =0.5631579
##     class counts:     0   107
##    probabilities: 0.000 1.000 
## 
## Node number 4: 77 observations
##   predicted class=0  expected loss=0  P(node) =0.4052632
##     class counts:    77     0
##    probabilities: 1.000 0.000 
## 
## Node number 5: 6 observations,    complexity param=0.0125
##   predicted class=0  expected loss=0.5  P(node) =0.03157895
##     class counts:     3     3
##    probabilities: 0.500 0.500 
##   left son=10 (4 obs) right son=11 (2 obs)
##   Primary splits:
##       Temperature < -0.4554149 to the right, improve=1.5000000, (0 missing)
##       FFMC        < 0.2165638  to the right, improve=1.5000000, (0 missing)
##       DC          < -0.5891221 to the left,  improve=1.5000000, (0 missing)
##       ISI         < -0.5036757 to the left,  improve=1.5000000, (0 missing)
##       DMC         < -0.5915142 to the right, improve=0.3333333, (0 missing)
##   Surrogate splits:
##       DMC < -0.5915142 to the right, agree=0.833, adj=0.5, (0 split)
##       ISI < -0.4796039 to the left,  agree=0.833, adj=0.5, (0 split)
## 
## Node number 10: 4 observations
##   predicted class=0  expected loss=0.25  P(node) =0.02105263
##     class counts:     3     1
##    probabilities: 0.750 0.250 
## 
## Node number 11: 2 observations
##   predicted class=1  expected loss=0  P(node) =0.01052632
##     class counts:     0     2
##    probabilities: 0.000 1.000
#Plot Decision Tree
printcp(pruned_base_model)
## 
## Classification tree:
## rpart(formula = Classes ~ Temperature + Rain + FFMC + DMC + DC + 
##     ISI + BUI + FWI + RH, data = train_set, method = "class", 
##     control = rpart.control(cp = 0, maxdepth = 8, minsplit = 5))
## 
## Variables actually used in tree construction:
## [1] FFMC        ISI         Temperature
## 
## Root node error: 80/190 = 0.42105
## 
## n= 190 
## 
##       CP nsplit rel error xerror     xstd
## 1 0.9625      0    1.0000 1.0000 0.085070
## 2 0.0125      1    0.0375 0.1000 0.034603
## 3 0.0000      3    0.0125 0.0875 0.032457
rpart.plot(pruned_base_model)

The summary of our base model will give us the details of each split with the number of observations, the value of the complexity parameter, the predicted class, the class counts with their probabilities and the children of the node. It will also give details about the future splits starting with the primary splits that will follow and the percent improvement in the prediction as well as the surrogate splits that come later on.

The resulting tree as explained in the above section, is the smallest tree with the lowest miss-classification loss. This tree is plotted with the split details and leaf node classes.

The optimal cp value was found in the above output.

Testing the pruned model

On the training set

preds_pruned = predict(pruned_base_model,train_set, type="class")
confusionMatrix(preds_pruned, train_set$Classes,
                mode = "everything",
                positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  80   1
##          1   0 109
##                                          
##                Accuracy : 0.9947         
##                  95% CI : (0.971, 0.9999)
##     No Information Rate : 0.5789         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9892         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9909         
##             Specificity : 1.0000         
##          Pos Pred Value : 1.0000         
##          Neg Pred Value : 0.9877         
##               Precision : 1.0000         
##                  Recall : 0.9909         
##                      F1 : 0.9954         
##              Prevalence : 0.5789         
##          Detection Rate : 0.5737         
##    Detection Prevalence : 0.5737         
##       Balanced Accuracy : 0.9955         
##                                          
##        'Positive' Class : 1              
## 

The train accuracy of our tree is 99.47% with an f1 score of 99.54% as well with only one miss-classifications.

On the testing set

We have a 96.23% accuracy on our held-out validation set which is better than the unpruned tree, meaning we have successfully avoided over-fitting using tree pruning.

preds_pruned = predict(pruned_base_model,test_set, type="class")
confusionMatrix(preds_pruned, test_set$Classes,
                mode = "everything",
                positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 26  2
##          1  0 25
##                                           
##                Accuracy : 0.9623          
##                  95% CI : (0.8702, 0.9954)
##     No Information Rate : 0.5094          
##     P-Value [Acc > NIR] : 3.976e-13       
##                                           
##                   Kappa : 0.9246          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.9259          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9286          
##               Precision : 1.0000          
##                  Recall : 0.9259          
##                      F1 : 0.9615          
##              Prevalence : 0.5094          
##          Detection Rate : 0.4717          
##    Detection Prevalence : 0.4717          
##       Balanced Accuracy : 0.9630          
##                                           
##        'Positive' Class : 1               
## 
ROC curve for pruned model

The resulting AUC is 96.29%

## [1] "The AUC is:  0.962962962962963"

BAGGING

Bagging, or bootstrap aggregation, is an ensemble method that involves training the same algorithm many times by using different subsets sampled from the training data. The final output prediction is then averaged across the predictions of all the sub-models. The two most popular bagging ensemble techniques are Bagged Decision Trees and Random Forest.

Bagged Decision Trees

This method performs best with algorithms that have high variance. The argument method=“treebag” specifies the algorithm. We will train our model using a 5-fold cross validation repeated 5 times. The sampling strategy used for the bagged trees is ROSE.

Training the model
control <- trainControl(method="repeatedcv", number=5, repeats=5)

bagCART_model <- train(Classes ~ Temperature+Rain+FFMC+DMC+DC+ISI+BUI+FWI+RH, data=train_set, method="treebag", metric="Accuracy", trControl=control)
Training accuracy

We achieved a perfect fit using bagged trees trained using a 5-fold CV repeated 5 times.

preds_bag = predict(bagCART_model,train_set, type="raw")
confusionMatrix(preds_bag, train_set$Classes,
                mode = "everything",
                positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  80   0
##          1   0 110
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9808, 1)
##     No Information Rate : 0.5789     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##               Precision : 1.0000     
##                  Recall : 1.0000     
##                      F1 : 1.0000     
##              Prevalence : 0.5789     
##          Detection Rate : 0.5789     
##    Detection Prevalence : 0.5789     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 1          
## 
Testing accuracy

The bagged model did not achieve a perfect performance on unseen data which leads us to believe it over fit the data. This can be caused by the fact that the bagged trees were highly correlated between each other which could be due to the absence of randomization in the features used for each bagged tree. What happened most probably is the use of the same strong predictors in all bagged trees causing this high correlation between them. In order to get rid of it we will implement Random Forests next which adds this randomization in the features selected for each bagged tree. The accuracy we got is 94.34%

preds_bag = predict(bagCART_model,test_set, type="raw")
confusionMatrix(preds_bag, as.factor(test_set$Classes),
                mode = "everything",
                positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 26  3
##          1  0 24
##                                           
##                Accuracy : 0.9434          
##                  95% CI : (0.8434, 0.9882)
##     No Information Rate : 0.5094          
##     P-Value [Acc > NIR] : 6.652e-12       
##                                           
##                   Kappa : 0.887           
##                                           
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.8966          
##               Precision : 1.0000          
##                  Recall : 0.8889          
##                      F1 : 0.9412          
##              Prevalence : 0.5094          
##          Detection Rate : 0.4528          
##    Detection Prevalence : 0.4528          
##       Balanced Accuracy : 0.9444          
##                                           
##        'Positive' Class : 1               
## 

Random Forests

Random Forest is an extension of bagged decision trees, where in addition to sampling the data, we also sample the variables in each bagged decision tree. The trees are constructed with the objective of reducing the correlation between the individual decision trees by making sure we do not use the same strong predictors in all bagged trees resulting in strongly correlated trees.

Training the model

control <- trainControl(method="repeatedcv", number=5, repeats=5)

rf_model <- train(Classes ~ Temperature+Rain+FFMC+DMC+DC+ISI+BUI+FWI+RH, data=train_set, method="rf", metric="Accuracy", trControl=control)

Testing the model

On the training set

Once again, our random forest model achieved a perfect first with a 5-fold CV repeated 5 times.

preds_rf = predict(rf_model,train_set, type="raw")
confusionMatrix(preds_rf, train_set$Classes,
                mode = "everything",
                positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  80   0
##          1   0 110
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9808, 1)
##     No Information Rate : 0.5789     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##               Precision : 1.0000     
##                  Recall : 1.0000     
##                      F1 : 1.0000     
##              Prevalence : 0.5789     
##          Detection Rate : 0.5789     
##    Detection Prevalence : 0.5789     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 1          
## 

On the testing set

On our dataset, the random forest did not perform better that the previous models, yielding an accuracy of 96.23%. Random forests performed as well as the pruned trees model.

preds_rf = predict(rf_model,test_set, type="raw")
confusionMatrix(preds_rf, as.factor(test_set$Classes),
                mode = "everything",
                positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 26  2
##          1  0 25
##                                           
##                Accuracy : 0.9623          
##                  95% CI : (0.8702, 0.9954)
##     No Information Rate : 0.5094          
##     P-Value [Acc > NIR] : 3.976e-13       
##                                           
##                   Kappa : 0.9246          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.9259          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9286          
##               Precision : 1.0000          
##                  Recall : 0.9259          
##                      F1 : 0.9615          
##              Prevalence : 0.5094          
##          Detection Rate : 0.4717          
##    Detection Prevalence : 0.4717          
##       Balanced Accuracy : 0.9630          
##                                           
##        'Positive' Class : 1               
## 

BOOSTING

In boosting, multiple models are trained sequentially and each model learns from the errors of its predecessors. We will use the Stochastic Gradient Boosting algorithm.

Stochastic Gradient Boosting

Training the model

An important thing to note is that stochastic gradient boosting takes a much longer time to train as it is a step-wise method which takes a lot of iterations to converge. Adding cross-validation makes it even longer.

control <- trainControl(method="cv", number=5)

SGB <- train(Classes ~ Temperature+Rain+FFMC+DMC+DC+ISI+BUI+FWI+RH, data=train_set, method="gbm", metric="Accuracy", trControl=control)

Testing the model

On the training set

100% accuracy on training data.

preds_sgb = predict(SGB,train_set, type="raw")
confusionMatrix(preds_sgb, train_set$Classes,
                mode = "everything",
                positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  79   0
##          1   1 110
##                                          
##                Accuracy : 0.9947         
##                  95% CI : (0.971, 0.9999)
##     No Information Rate : 0.5789         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9892         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 1.0000         
##             Specificity : 0.9875         
##          Pos Pred Value : 0.9910         
##          Neg Pred Value : 1.0000         
##               Precision : 0.9910         
##                  Recall : 1.0000         
##                      F1 : 0.9955         
##              Prevalence : 0.5789         
##          Detection Rate : 0.5789         
##    Detection Prevalence : 0.5842         
##       Balanced Accuracy : 0.9938         
##                                          
##        'Positive' Class : 1              
## 

On the testing set

98.11% accuracy on unseen data, same as the training set, boosting brought much improvement over our random forest model even though it took a greater training time.

preds_sgb = predict(SGB,test_set, type="raw")
confusionMatrix(preds_sgb, as.factor(test_set$Classes),
                mode = "everything",
                positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 26  1
##          1  0 26
##                                           
##                Accuracy : 0.9811          
##                  95% CI : (0.8993, 0.9995)
##     No Information Rate : 0.5094          
##     P-Value [Acc > NIR] : 1.556e-14       
##                                           
##                   Kappa : 0.9623          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9630          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9630          
##               Precision : 1.0000          
##                  Recall : 0.9630          
##                      F1 : 0.9811          
##              Prevalence : 0.5094          
##          Detection Rate : 0.4906          
##    Detection Prevalence : 0.4906          
##       Balanced Accuracy : 0.9815          
##                                           
##        'Positive' Class : 1               
## 

Plotting to ROC curve

## [1] "The AUC is:  0.981481481481481"

Final notes on Ensemble methods

  • We can see that our basic model yielded good performance on unseen data, however ensemble methods like random forests and boosting brought much improvement on the accuracy of prediction.
  • Another note is that the results are very dependent on the training set, meaning that small fluctuations in train and test accuracy will happen if we change the random seed used in the train-test split for instance. This has happened during our experiments as they resulted in different values on different laptops and for different values of the seed.

SVM

Support Vector Machine is a discriminative classifier that classifies observations using a hyper-plane that best differentiates between the classes. Its advantages lay in the fact that they are very flexible and work well on high-dimensional data.

We will use SVM on our dataset to demonstrate its capabilities.

Training the model

The goal of the SVM is to identify a boundary that minimizes the total distance between the hyper-plane and the closest points on each class.

There are two hyper-parameters to take into consideration before training our SVM model: first, the cost C which acts as a regularization parameter and trades off correct classifications of the training examples against the maximization of the decision boundary. In other words, the greater the value of C the higher the number of errors occurring in the training classifications. The second hyper-parameter gamma defines how much curvature we want in our decision boundary.

We start by tuning our model according to different values of gamma and C. We will start by using a linear kernel.

To use the cross validation functions from the Caret package, we need to turn the 0/1 categorical values of the variable Classes into “fire”/“not_fire” (as required by the function). The functions provided will allow us to find the best values for both gamma and C. we used a tuning length of 10.

train_set_svm <- train_set
test_set_svm <- test_set

train_set_svm$Classes <- mapvalues(train_set$Classes, from=c(0,1), to=c("not_fire","fire"))
test_set_svm$Classes <- mapvalues(test_set$Classes, from=c(0,1), to=c("not_fire","fire"))

We perform cross validation repeated 5 times.

control = trainControl(method = "repeatedcv", repeats=5, summaryFunction=twoClassSummary, classProbs=TRUE)

svm_model_radial <- train(Classes ~ Temperature+Rain+FFMC+DMC+DC+ISI+BUI+FWI+RH, data = train_set_svm, method="svmRadial", tuneLength=10, metric="ROC", trControl=control)

svm_model_linear <- train(Classes ~ Temperature+Rain+FFMC+DMC+DC+ISI+BUI+FWI+RH, data = train_set_svm, method="svmLinear", tuneLength=10, metric="ROC", trControl=control)

svm_model_radial
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 190 samples
##   9 predictor
##   2 classes: 'not_fire', 'fire' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 171, 171, 171, 171, 171, 171, ... 
## Resampling results across tuning parameters:
## 
##   C       ROC        Sens    Spec     
##     0.25  0.9840909  0.9150  0.9309091
##     0.50  0.9879545  0.9100  0.9309091
##     1.00  0.9918182  0.9375  0.9545455
##     2.00  0.9920455  0.9450  0.9600000
##     4.00  0.9913636  0.9600  0.9690909
##     8.00  0.9947727  0.9575  0.9654545
##    16.00  0.9952273  0.9600  0.9654545
##    32.00  0.9963636  0.9625  0.9727273
##    64.00  0.9965909  0.9650  0.9727273
##   128.00  0.9965909  0.9600  0.9727273
## 
## Tuning parameter 'sigma' was held constant at a value of 0.2096875
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.2096875 and C = 64.
svm_model_linear
## Support Vector Machines with Linear Kernel 
## 
## 190 samples
##   9 predictor
##   2 classes: 'not_fire', 'fire' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 171, 171, 171, 171, 171, 171, ... 
## Resampling results:
## 
##   ROC        Sens    Spec     
##   0.9965909  0.9525  0.9690909
## 
## Tuning parameter 'C' was held constant at a value of 1

From the results above we can see the best values for sigma and C

We will then proceed to show the confusion matrix of the model.

preds_svm_train_linear <- predict(svm_model_linear, train_set_svm)
preds_svm_train_radial <- predict(svm_model_radial, train_set_svm)
  
confusionMatrix(preds_svm_train_linear,train_set_svm$Classes, positive="fire")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction not_fire fire
##   not_fire       79    1
##   fire            1  109
##                                           
##                Accuracy : 0.9895          
##                  95% CI : (0.9625, 0.9987)
##     No Information Rate : 0.5789          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9784          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9909          
##             Specificity : 0.9875          
##          Pos Pred Value : 0.9909          
##          Neg Pred Value : 0.9875          
##              Prevalence : 0.5789          
##          Detection Rate : 0.5737          
##    Detection Prevalence : 0.5789          
##       Balanced Accuracy : 0.9892          
##                                           
##        'Positive' Class : fire            
## 
confusionMatrix(preds_svm_train_radial,train_set_svm$Classes, positive="fire")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction not_fire fire
##   not_fire       80    0
##   fire            0  110
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9808, 1)
##     No Information Rate : 0.5789     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.5789     
##          Detection Rate : 0.5789     
##    Detection Prevalence : 0.5789     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : fire       
## 
preds_svm_test_linear <- predict(svm_model_linear, test_set_svm)
preds_svm_test_radial <- predict(svm_model_radial, test_set_svm)

confusionMatrix(preds_svm_test_linear,test_set_svm$Classes, positive="fire")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction not_fire fire
##   not_fire       26    1
##   fire            0   26
##                                           
##                Accuracy : 0.9811          
##                  95% CI : (0.8993, 0.9995)
##     No Information Rate : 0.5094          
##     P-Value [Acc > NIR] : 1.556e-14       
##                                           
##                   Kappa : 0.9623          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9630          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9630          
##              Prevalence : 0.5094          
##          Detection Rate : 0.4906          
##    Detection Prevalence : 0.4906          
##       Balanced Accuracy : 0.9815          
##                                           
##        'Positive' Class : fire            
## 
confusionMatrix(preds_svm_test_radial,test_set_svm$Classes, positive="fire")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction not_fire fire
##   not_fire       24    3
##   fire            2   24
##                                           
##                Accuracy : 0.9057          
##                  95% CI : (0.7934, 0.9687)
##     No Information Rate : 0.5094          
##     P-Value [Acc > NIR] : 7.924e-10       
##                                           
##                   Kappa : 0.8114          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.9231          
##          Pos Pred Value : 0.9231          
##          Neg Pred Value : 0.8889          
##              Prevalence : 0.5094          
##          Detection Rate : 0.4528          
##    Detection Prevalence : 0.4906          
##       Balanced Accuracy : 0.9060          
##                                           
##        'Positive' Class : fire            
## 

On the training data, the radial kernel performed better than the linear because it fit the data better thus results in higher accuracy. However, on the testing data, the linear kernel resulted in better performance due to its ability to generalize better.

Plotting the ROC curve

The model with the linear kernel gave us an AUC of 96.2%.

## [1] "The AUC is:  0.981481481481481"

However, the radial kernel gave us a value of 88.75%

## [1] "The AUC is:  0.905982905982906"

Comparing the Models

To compare across our models, we created a function that gets the different relevant metrics and stores them in a dataframe. The metrics we are concerned about are: 1. accuracy 2. f1 score 3. auc 4. error rate 5. precision 6. recall 7. specificity 8. balanced accuracy 9. false positive rates and false negative rates

#Import metrica package that gives access to a variety of performance metrics in R
library(metrica)
## 
## Attaching package: 'metrica'
## The following objects are masked from 'package:caret':
## 
##     MAE, precision, R2, recall, RMSE, specificity
## The following object is masked from 'package:lattice':
## 
##     barley
library(cvAUC)
#First we initialize a vector with our column names
columns = c("Model","Accuracy", "F1-Score", "AUC_roc","Error_Rate","Precision","Recall", "Specificity", "Balanced Accuracy", "FPR", "FNR")
#Now we define a function that takes in our model, reference which is our test set, predictions and the dataframe
comparison_metrics <- function(model_name, obs, pred) {
  obs = as.numeric(obs)
  pred = as.numeric(pred)
   acc <- accuracy(obs = obs,pred = pred)
   f1 <- fscore(obs = obs,pred = pred)
   auc <- AUC(pred, obs)
   errate <- error_rate(obs = obs,pred = pred)
   precision <- precision(obs = obs,pred = pred)
   recall <- recall(obs = obs,pred = pred)
   specificity <- specificity(obs = obs,pred = pred)
   ba <- balacc(obs = obs,pred = pred)
   fpr <- FPR(obs = obs,pred = pred)
   fnr <- FNR(obs = obs,pred = pred)
   
   return(c(model_name,acc,f1,auc,errate,precision,recall,specificity,ba,fpr,fnr))
}
comparison_df = data.frame(matrix(nrow = 0, ncol = length(columns)))
colnames(comparison_df) = columns

We then insert the metrics according to each model row by row into our dataframe.

comparison_df[nrow(comparison_df) + 1, ] <- comparison_metrics("Logistic regression", test_set$Classes, preds_logistic)
comparison_df[nrow(comparison_df) + 1, ] <- comparison_metrics("LDA", test_set$Classes, preds_lda$class )
comparison_df[nrow(comparison_df) + 1, ] <- comparison_metrics("QDA", test_set$Classes, preds_qda$class)
comparison_df[nrow(comparison_df) + 1, ] <- comparison_metrics("KNN", test_set$Classes,preds_knn)
comparison_df[nrow(comparison_df) + 1, ] <- comparison_metrics("Unpruned_Trees", test_set$Classes, preds_unpruned)
comparison_df[nrow(comparison_df) + 1, ] <- comparison_metrics("Pruned_Trees", test_set$Classes, preds_pruned)
comparison_df[nrow(comparison_df) + 1, ] <- comparison_metrics("Bagged_Trees", test_set$Classes, preds_bag)
comparison_df[nrow(comparison_df) + 1, ] <- comparison_metrics("Random_Forest", test_set$Classes, preds_rf)
comparison_df[nrow(comparison_df) + 1, ] <- comparison_metrics("Boosted_Trees", test_set$Classes, preds_sgb)
comparison_df[nrow(comparison_df) + 1, ] <- comparison_metrics("SVM_linear", test_set$Classes, preds_svm_test_linear)
comparison_df[nrow(comparison_df) + 1, ] <- comparison_metrics("SVM_radial", test_set$Classes, preds_svm_test_radial)

This results in the following matrix

comparison_df

Visualizing the results

We sorted the outcomes of the AUC calculations and visualized them by model in a bar chart from least to greatest.

comparison_df_sorted <- comparison_df[order(-comparison_df$AUC_roc),]
comparison_df_sorted %>%
  ggplot(aes(x=reorder(Model,+AUC_roc), y=AUC_roc, fill=Model)) +
  theme(axis.text.x=element_blank()) +
  labs(x = "Models Tested", y = "AUC Value") +
  geom_col(position="dodge")

We also visualized the models in terms of accuracy below

comparison_df_sorted <- comparison_df[order(-comparison_df$Accuracy),]
comparison_df_sorted %>%
  ggplot(aes(x=reorder(Model,+AUC_roc), y=AUC_roc, fill=Model)) +
  theme(axis.text.x=element_blank()) +
  labs(x = "Models Tested", y = "Accuracy") +
  geom_col(position="dodge")

Observations

  1. We can see than both the bagged and boosted trees are the best performing in terms of AUC out of all the models that were tested. This indicates that these ensemble methods were the best at learning the significant features in our dataset.
  2. The LDA model performed better than Logistic Regression, which indicates that the observations were most likely drawn from a Gaussian distribution therefore giving an advantage to LDA. Additionally, the latter has performed better than QDA as well which shows that first the decision boundary is linear, and second, QDA has overfitted the training set leading to a lack of generalization.
  3. One additional indicator that the decision boundary is linear is the fact the the radial kernel in SVM performed worse than the linear one.
  4. The unpruned trees model was the least effective as it did not do a good job in learning the key features in the dataset. Also, the Random Forest model was not effective as well since the dataset is small and this model needs a bigger number of observations so that it grows as deep as possible.